Using  Stream  Features  for  Instant  Document  Filtering 


Andreas  Bauer  Christian  Wolff 

Media  Informatics  Group  Media  Informatics  Group 
University  of  Regensburg  University  of  Regensburg 
Regensburg,  Germany  Regensburg,  Germany 

andreas .bauer@extern.ur . de  Christian. wolf f @ur . de 


Abstract 

In  this  paper,  we  discuss  how  event  processing  technolo¬ 
gies  can  be  employed  for  real-time  text  stream  processing 
and  information  filtering  in  the  context  of  the  TREC  2012 
microblog  task.  After  introducing  basic  characteristics  of 
stream  and  event  processing,  the  technical  architecture  of  our 
text  stream  analysis  engine  is  presented.  Employing  well- 
known  term  weighting  schemes  from  document- centric  text 
retrieval  for  temporally  dynamic  text  streams  is  discussed 
next,  giving  details  of  the  ESPER  Event  Processing  Agents 
(EPAs)  we  have  implemented  for  this  task.  Finally,  we  de¬ 
scribe  our  experimental  setup,  give  details  on  the  TREC  mi¬ 
croblog  runs  as  well  as  the  result  thereafter  with  our  system 
including  some  extensions  and  give  a  short  interpretation  of 
the  evaluation  results. 
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I.  INTRODUCTION 

Due  to  the  rapid  growth  in  user-generated  digital  content, 
data  stream  processing  has  received  increasing  scholarly  at¬ 
tention.  Most  of  this  content  is  textual,  thus  investigating 
how  to  effectively  rank  real-time  text  streams  is  an  interest¬ 
ing  research  question. 

In  this  paper  we  present  an  event-based  approach  to  real- 
time  information  filtering  as  well  as  our  results  created  for 
the  microblog  real-time  filtering  task  for  TREC  2012.  In  ad¬ 
dition,  we  discuss  improved  results,  which  we  have  achieved 
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after  the  TREC  2012  microblog  task  deadline.  We  also  show 
how  to  leverage  (complex)  event  processing  engines  for  con¬ 
tinuous  term  weighting  and  feature  generation. 

2.  EVENT-BASED  INFORMATION  FIL¬ 
TERING 

The  basic  idea  of  splitting  and  mapping  text  streams  onto 
semantically  distinct  event  types  has  already  been  presented 
in  [1],  In  short,  the  approach  here  is  to  feed  an  incoming 
tweet  into  a  network  of  event  processing  agents  that  imme¬ 
diately  execute  the  analysis  and  provide  an  instant  ranking 
of  the  tweet.  This  is  possible  because  modern  event  process¬ 
ing  engines  like  Esper1,  Tibco  BusinessEvents2  or  Drools 
Fusion 3  support  high-speed  processing  of  events. 

More  recently  a  streaming  version  for  the  big  data  frame¬ 
work  Hadoop  has  been  released4  and  Twitter  has  published 
its  real-time  streaming  system  Storm5.  In  addition,  the  S4 
distributes  streaming  platform,  originally  published  by  Ya¬ 
hoo,  is  now  an  incubator  project  supported  by  the  Apache 
Foundation  as  well.6.  All  these  developments  show  that 
event  processing  is  still  needed  and  is  consider  as  a  viable  and 
elementary  part  of  the  efficient  processing  of  large  amounts 
of  data.  Big  Data  and  event  processing  are  no  mutually  ex¬ 
clusive  concepts  but  rather  complementary,  where  event  pro¬ 
cessing  addresses  interesting  goals  in  analysing  large  amount 
of  data  by  offering  features  like  sliding  windows  or  pattern 
matching. 

It  is  quite  obvious  that  streaming  and  event  processing 
offer  major  opportunities  for  analysing  text  streams  in  real¬ 
time  and  that  the  industry  is  not  only  focusing  on  increasing 
the  speed  of  analysing  large  amount  of  data.  While  the  event 
processing  paradigm  as  proposed  by  David  Luckham[9]  orig¬ 
inally  focussed  on  business  applications,  the  applicability  for 
information  retrieval  tasks  has  been  recognized  in  more  re¬ 
cent  work [3,  p.  10]  [10,  p.  42].  We  have  used  Esper  as  our 
event  processing  engine  of  choice,  because  it  is  open  source, 
offers  good  online  support  and  allows  for  a  straightforward 

4http : //www. espertech. com 

2http : //www. tibco . com/products/event-processing/ 
complex- event-processing/businesse vent s/default . 
jsp 

4https : //www. j boss . org/drools/dr ools-fusion.html 
4http : / /hadoop . apache . org/ docs/rO .15.2/ streaming . 
html 

'http : //engineering. twitter . com/20 11/08/ storm- is- 
coming-more- details- and-plans .html 
6http : / / incubator . apache . org/s4/ 
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integration  of  high  volume  text  stream  analysis.  Some  of 
the  advantages  of  an  event  processing  approach  are: 

1.  Clear  semantics:  Due  to  the  real-time  scenario  each 
digital  utterance  has  to  be  considered  in  its  temporal 
context.  The  sheer  amount  of  information  constantly 
generated  raises  the  probability  of  an  information  be¬ 
ing  lost,  overlooked  or  ignored  increases  as  time  passes 
by.  The  event  metaphor  is  well  suited  for  this  type  of 
informational  scenario  because  it  stresses  the  temporal 
aspect. 

2.  Interoperability :  From  a  design  point  of  view  the  map¬ 
ping  of  text  onto  event  types  allows  for  easy  combina¬ 
tion  of  events  from  different  sources.  E.g.  if  Facebook 
status  updates  and  Tweets  are  mapped  onto  the  same 
basic  event  type  like  Token  Event,  Location  Event,  or 
Sentiment  Event,  cross  data  stream  analysis  can  eas¬ 
ily  be  performed,  because  the  same  event  processing 
agents  (EPAs)  as  well  as  the  same  event  processing 
network  (EPN)  can  be  used. 

3.  Technical  integration:  Many  current  big  data  or  large- 
scale  analysis  systems  rely  on  exploiting  the  tempo¬ 
ral  nature  of  information.  Thus,  they  are  designed  to 
work  with  events  that  can  be  analysed  in  a  temporal 
manner  by  providing  temporal  meta-information,  e.g. 
detection  time,  creation  time  or  expiry  time. 

3.  TECHNICAL  ARCHITECTURE 

Figure  1  shows  the  overall  technical  architecture  of  the 
system  we  have  used  for  our  experiments.  Mircoblogs  are 


Figure  1:  Architecture  Overview 

imported  via  a  JSON  interface  and  fed  into  the  text  prepro¬ 
cessing  component,  which  includes  the  following  steps: 

1.  Conversion  to  lower  case 

2.  Stemming  with  the  Porter  stemmer 

3.  Stop  word  removal' . 

7Stop  word  list  used:  ftp://ftp.cs.cornell.edu/pub/ 
smart/ english . stop 


4.  Enriching  basic  tweet  events  with  statistics  on  up¬ 
per/lower  case  characters,  character  count  and  stop 
word  count. 

5.  Language  detection 

After  the  conversion  of  raw  Tweets  into  TweetEvents, 
the  latter  are  fed  into  the  event  processing  network  using 
the  ESPER  event  processing  engine.  First,  each  tweet  is 
sent  to  an  event  processing  agent  that  splits  up  the  text  of 
the  tweet  and  maps  it  onto  semantically  distinct  event  types. 
Then  these  events  are  processed  by  several  other  Event  Pro¬ 
cessing  Agents,  that  calculate  stream  statistics  like  average 
text  length,  count  of  distinct  tokens,  term  counts,  etc. 

Important  to  notice  are  the  SearchProfileEPAs  that 
were  created  for  every  TREC  topic.  Within  these  EPAs 
the  filtering  was  done  which  was  based  on  different  ranking 
schemes.  Important  to  fulfil  the  real-time  requirement  the 
amount  of  Tweets  that  should  be  examined  was  decrease 
by  a  first  filtering  step  that  prevent  many  Tweets  from  en¬ 
tering  the  real  filtering  stage.  This  first  filtering  step  has 
major  influence  on  the  recall  of  the  system,  because  it  acts 
as  a  gatekeeper  to  the  filtering  processes.  Hence  a  strict 
filter  condition  increases  precision,  but  lowers  recall  for  in¬ 
stance.  For  the  runs  we  present  in  this  paper  the  primary 
filtering  rule  is  shown  in  condition  1 

Filter  =  {x\qm  >  1  V  stl  —  1  V  (qm  >  0  A  fm  >  1)}  (1) 

With  qm  being  a  term  match  of  a  topic  term  and  a  term 
in  the  Tweet,  stl  being  a  search  profile  with  only  one  search 
term  and  fm  being  a  feedback  match,  i.e.  a  feedback  term 
was  encountered  in  an  incoming  Tweet. 

4.  USING  STREAM  FEATURES  FOR 
RANKING 

The  stream  features  that  were  described  in  the  previous 
section  are  now  processed  using  different  ranking  algorithms. 
For  the  TREC  2012  runs,  we  were  only  able  to  provide  two 
algorithms,  one  based  on  OkapiBM25  and  the  other  a  stan¬ 
dard  Vector  Space  Model-based  approach.  For  each  schema 
we  have  provided  one  run  with  and  without  relevance  feed¬ 
back.  In  later  publications  we  will  present  the  application  of 
stream  based  features  using  BursT,  TF/ICF  and  incremen¬ 
tal  TF/IDF.  Furthermore  we  will  present  the  results  of  plug¬ 
ging  stream  based  TF/IDF  values  into  weighting  schemes 
like  ATC  and  LTU[5,  p.  4].  Here,  we  would  like  to  find 
out  whether  stream  based  features  calculated  by  applying 
sliding  time  windows  onto  the  data  streams  offer  reasonable 
results. 

This  is  relevant  because  this  approach  emphasizes  the 
temporal  sensitivity  of  events,  i.e.  a  streamed  approach  re¬ 
flects  better  the  continuous  nature  of  a  text  stream. 

4.1  Components  of  a  Document  Ranking 
Scheme 

[16,  p.  517f]  state  that  there  are  three  relevant  compo¬ 
nents  within  a  term  weighting  scheme:  Term  frequency  de¬ 
scribes  what  a  document  is  about.  The  second  component 
is  a  factor  that  reflects  the  distribution  of  terms  in  the  doc¬ 
ument  collection  as  a  whole  while  the  third  factor  takes  into 
account  the  length  of  a  document  in  order  to  avoid  over- 


or  under  representing  terms  in  the  weighting  scheme  (docu¬ 
ment  length  normalization).  All  of  these  components  were 
taken  into  account  and  adjusted  for  the  real-time  filtering 
scenario.  In  the  following  subsections  we  discuss  how  con¬ 
cepts  like  term,  document,  document  collection  and  rela¬ 
tionships  have  to  be  adjusted  for  the  microblog  scenario  and 
its  streaming  and  real-time  aspects. 

4.1.1  Determining  term  frequency  in  text  streams 

[12,  p.  2]  have  showed  that  85%  of  all  Tweets  contain 
terms  only  once.  Hence  the  traditional  term  frequency  mea¬ 
sures  that  rely  on  the  term  count  within  a  document  fail, 
because  for  almost  every  token  within  a  tweet  the  tf  value 
would  be  the  same.  To  overcome  this  problem  we  have  used 
stream  statistics  in  order  to  derive  a  term  frequency  replace¬ 
ment  value  that  reflects  the  importance  of  a  term.  As  men¬ 
tioned  above  we  use  sliding  time  windows:  With  these  win¬ 
dows  we  build  a  windows  into  the  past  that  reflect  the  last 
k  seconds,  minutes  or  events  of  the  data  stream.  We  think 
that  this  approach  reflects  best  the  temporal  notion  of  in¬ 
formation  streams. 

For  the  runs  we  submitted  to  TREC  2011  we  have  used 
a  sliding  time  window  of  120  seconds  to  calculate  ad-hoc 
statistics  for  each  event  stream.  Assuming  we  could  use 
approx.  4.5  out  of  12  million  downloaded  Tweets  distributed 
equally  over  16  days,  we  get  an  average  event  frequency  of 
approx.  300k  events  per  day.  Assuming  a  constant  event 
arrival  rate,  this  means  that  only  4  events  arrive  per  second. 

In  the  set-up  for  the  TREC  runs  we  have  set  the  event  ar¬ 
rival  rate  to  500  events  per  second,  because  otherwise  a  run 
would  have  taken  understandably  16  days,  if  we  had  kept 
the  original  arrival  gaps.  Extrapolating  the  time  window  of 
120  seconds,  which  we  used  for  the  TREC  runs,  this  cor¬ 
responds  to  a  real  world  window  of  approx.  4  hours.  This 
value  was  chosen  arbitrarily  and  will  be  subject  of  further 
investigations.  In  general,  we  can  assume  that  the  smaller  a 
time  window  is  the  better  it  reflects  the  most  recent  changes. 

For  the  post-TREC  runs  -  runs  we  created  after  the  dead¬ 
line  of  TREC  -  we  changed  the  values  to  1200  events  per 
second.  The  best  results,  which  will  be  presented  in  the 
last  section5.1.1,  were  yielded  with  a  time  window  of  240 
seconds.  If  we  extrapolate  this  to  real  time  this  would  cor¬ 
respond  to  a  sliding  time  window  of  almost  one  day. 

We  use  the  sliding  time  windows  to  build  dynamic  statis¬ 
tics  for  different  events  type  of  the  stream:  We  have  built 
the  top-fc  window  for  hashtags  and  tokens  event  stream.  For 
the  presented  runs  we  only  used  hashtags  and  tokens  for  cal¬ 
culating  a  local  term  weighting  factor,  because  hashtags  are 
“’’derived11”  from  regular  tokens  and  hence  can  be  used  for 
the  boolean  matching  step  that  is  conducted  in  order  to  de¬ 
termine  the  terms  that  should  be  weighted. 

There  are  more  special  semantics  to  Twitter  like 
retweets (rtusername) ,  mentions  (@username)  and  or  em¬ 
bedded  links,  that  could  be  used  for  filtering  the  stream,  but 
this  is  still  subject  to  further  investigation. 

Listing  Is  shows  how  to  construct  the  top-k  ranking  for 
retweets.  Each  insert  into  statement  can  be  considered  as 
a  separate  EPA.  Esper  offers  the  possibility  to  subscribe  to 
such  statements  and  then  Java  code  can  be  executed  on  the 
events  that  are  being  processed  by  the  EPA. 


81  shows  how  the  top-A:  retweets  are  generated. 


Stream 

Weight 

Token  Stream 

1.0 

Hashtag  Stream 

1.5 

Table  1:  Stream  weights  for  TREC  2012  runs 


Listing  1:  Building  an  EPA  in  Esper 

create  window 
rtcount 

.  win  :  time  (const_stats_window  sec) 

. std :  unique (token ) 
as  (token  String 
,  cnt  Long 
,  type  String 
,  ts  Long); 

insert  into  rtcount_raw 
select 

istream  token 
, count ( * )  as  cnt 
,  ’RtCount’  as  type 
, current_timestamp  ( )  as  ts 

from  RetweetEvent 

.  win  :  time  ( 

const_stats_window  sec 

) 

group  by  token 

having  count)*)  >  0 

output  every  var_output  sec ; 

insert  into  rtcount 
select  istream  token 
,  cnt,  ’RtCount’  as  type 
,  current_timestamp  ( )  as  ts 
from  rtcount_raw  s 
where  cnt  >  0; 

insert  into  topKrt 
select  token  ,  cnt 
from  rtcount 

output  every  var_output  sec 
order  by  cnt  desc  limit  top_k  ; 

For  the  TREC  runs  the  term  frequency  tf  is  then  being 
calculated  using  the  following  formula: 


tfw,t  —  1  +  TWi,t  * 


log  2{rankt  +  K) 


(2) 


with  TWt.t  being  an  event  stream  specific  weighting  value. 
For  the  runs  we  use  the  values  shown  in  table  1.  The  val¬ 
ues  are  heuristically  chosen,  based  on  observations  of  Twit¬ 
ter,  e.g.  [19]  or  [2] . 


Discussion  on  real  time  filtering  corpora. 

We  want  to  mention  that  we  compressed  the  text  stream 
for  our  runs  from  a  real  world  16  day  period  to  a  contin¬ 
uous,  one  hour  data  stream,  because  there  is  no  separate 
high-volume  microblog  corpus  available  for  the  TREC  2012 
filtering  task. 

This  might  cause  discussion  on  how  representative  the  re¬ 
sults  are  in  comparison  to  the  real  world  Twitter  stream, 
because  in  our  scenario  real  world  events  of  16  days  are  sim¬ 
ulated  to  happen  within  one  hour.  Hence  the  real  Twitter 


stream  might  offer  a  different  density  and  distribution  of 
terms  than  the  TREC  corpus.  This  will  be  subject  to  fur¬ 
ther  research  and  discussion,  but  we  think  that  our  approach 
is  valid,  because  experiments  with  smaller  time  windows  in 
the  post-TREC  runs  showed  also  comparable  results.  Due 
to  space  limitation  we  will  postpone  this  discussion  to  a  later 
paper. 

4.1.2  Determining  document  collection  features  in 
text  streams 

For  determining  the  document  collection  based  features 
we  employ  sliding  time  windows  as  well.  Sliding  time  win¬ 
dows  were  also  used  for  the  BursT  weighting  scheme[7], 
which  underpins  the  viability  of  our  approach.  Again,  we 
use  the  time  window  to  calculate  a  streamed  inverse  docu¬ 
ment  frequency (sIDF)  value  that  will  be  use  as  the  docu¬ 
ment  collection  value.  The  formula  is  the  same  as  proposed 
in  [6]  and  explained  in  [14,  p.  504],  but  it  is  based  on  the 
document  count  N  and  term  count  n  in  the  sliding  time 
window  w: 

sIDFw,t  =  log2  ^  (3) 

Tlw 

The  final  score  is  then  simply  calculated  by  plugging  the 
features  into  the  chosen  ranking  method.  For  the  TREC 
runs  we  used  a  Vector  Space  Model  and  OkapiBM25.  For 
the  post-TREC  runs  we  modified  OkapiBM25  slightly  as 
well  as  we  used  the  ATC  method[5]. 

4.2  Pseudo  Relevance  Feedback  for  Query 
Expansion  and  External  Evidence 

For  the  TREC  runs  no  external  evidence  was  used,  i.e. 
no  data  from  Wikipedia  or  from  a  search  engines  were  used 
nor  URLs  contained  in  a  Tweet  were  resolved  in  order  to 
adjust  the  search  profiles.  A  search  profile  had  the  following 
structure  shown  in  listing  2  and  Tweets  were  only  allowed  to 
be  considered  if  they  had  a  tweet  id  between  querytweetime 
and  querynewesttweet. 

Listing  2:  Sample  search  profile  TREC  filtering  tree 

<top> 

<num>  MB049  </nmn> 

<  t  i  1 1  e  >  carbon  monoxide  law  </title> 
<querytime> 

Tue  Feb  01  22:44:23  +0000  2011 
</ querytime> 

<querytweettime> 

32005451423948800 
</ queryt  weettime> 

<querynewesttweet> 

32569981321347074 
</ querynewestt  weet> 

</top> 

It  has  been  shown  ([11],  [8])  that  expanding  a  URL  con¬ 
tained  in  a  Tweet  can  improve  the  retrieval  and  ranking 
performance  of  algorithms.  This  approach  was  not  applied 
here,  as  the  delay  between  arrival  and  judgement  of  a  tweet 
should  be  kept  as  small  as  possible.  Hence  the  resolution  of 
a  URL,  its  parsing  and  analysis  would  have  introduced  an 
additional  time  gap  that  was  not  acceptable.  In  an  end-user 
scenario  where  a  real  user  does  not  require  immediate  esti¬ 
mation  of  new  Tweets,  this  constraint  may  be  loosened  and 


external  evidence  from  a  given  url  might  be  included.  As 
mentioned  above,  in  this  experiment  we  focus  on  the  imme¬ 
diately  available  textual  information  only. 

In  two  TREC-runs  pseudo-relevance  feedback  was  in¬ 
cluded  as  follows:  The  system  starts  without  any  additional 
query  terms.  While  the  system  is  running,  new  Tweets  ar¬ 
rive.  If  the  arriving  Tweet  is  considered  as  being  relevant 
according  to  the  judgements  provided  by  the  TREC  board, 
the  terms  of  this  Tweet  are  incorporated  into  the  search  pro¬ 
file  .  The  top  5  tokens  are  added  dynamically  to  the  search 
profile. 

But  this  approach  had  the  drawback  that  search  profiles 
got  dominated  by  general  terms9  that  diluted  the  search  pro¬ 
file.  But  despite  of  this  one  Okapi  run  submitted  to  TREC 
was  ranked  as  #T3  out  of  69  submitted  runs[18]. 

The  relevance  feedback  mechanism  was  improved  for 
the  post-TREC  runs.  The  positive  and  negative  feedback 
Tweets  were  saved  in  the  search  profile.  During  the  filtering 
phase  the  relevance  feedback  terms  were  queried  in  that  way 
the  all  -  no  ranking  or  selection  process  -  negative  feedback 
terms  were  subtracted  from  the  positive  ones.  Only  these 
terms  -  weighted  by  factor 

a 

-  were  used  in  addition  to  the  original  query  terms.  This 
substraction  method  was  the  reason  for  the  remarkable  in¬ 
crease  of  the  system.  In  further  research  we  will  try  to  de¬ 
scribe  in  detail  why  this  worked  and  if  it  only  worked  by 
chance  for  this  scenario. 

Without  the  new  feedback  method  the  result  stayed 
around  a  T11SU  around  .33,  but  with  the  it  went  up  to  .47. 
We  also  tried  a  Rocchio  based  feedback  and  incorporated  de¬ 
cay  factor  to  re-weight  the  terms  in  the  feedback  set.  Both 
did  not  yield  results  comparable  to  the  subtraction  method. 
The  result  stayed  around  a  T11SU  value  of  .33. 

4.3  Relevance  Decision 

The  relevance  decision  and  thus  the  setting  of  the  decision 
threshold  are  the  two  main  aspects  that  sustainably  influ¬ 
ence  the  performance  of  an  information  filtering  system. 

The  relevance  decision  for  the  TREC  run  was  as  follows. 
The  guidelines  of  the  microblog  filtering  task  asked  to  pro¬ 
vide  a  retrieval  decision  for  every  retrieved  Tweet.  To  do 
so  we  simply  built  an  ideal  document  by  adding  missing 
search  terms  to  the  Tweet  under  inspection  and  calculated 
the  score  for  this  Tweet.  So  we  had  two  scores  that  could 
be  use  to  generate  a  ratio.  For  the  TREC  runs  we  used  0.5 
as  the  threshold,  i.e.  every  Tweet  scoring  more  than  0.5  was 
marked  as  relevant. 

For  the  POST  TREC  runs  we  used  a  different  approach. 
Here  we  exploited  a  further  stream  characteristic  .  We  cal¬ 
culated  the  average  score  of  the  positive  marked  Tweets  in 
the  sliding  time  window.  If  an  incoming  Tweet  exceeded  the 
average  of  the  positive  feedback  samples  than  it  was  marked 
relevant.  This  approach  increased  all  performance  measures 
verifiably. 

In  order  to  verify  this  we  did  a  post-hoc  evaluation  of  our 
runs  and  determined  the  best  Tils'!/  value  by  increasing  the 
threshold  step  by  step.  This  retrospective  approach  yielded 
a  fixed  threshold  and  for  e.g.  ReverseOkapi  the  best  T11SU 
value  was  at  around  .46.  While  precision  was  almost  equal 

9In  [?  ,  p.  8]  general  terms  are  defined  as  occurring  in 

positive  as  well  as  in  negative  documents 


the  dynamic  average  approach  yielded  by  far  better  recall 
values.  This  is  obvious  as  a  dynamic  threshold  always  reflect 
the  current  situation  of  the  stream  and  hence  adapts  well  to 
changing  situations. 

5.  RUN  RESULTS 

In  this  section  we  present  the  results  for  the  TREC 
2012  and  the  post-TREC  runs.  Furthermore  we  provide 
a  comparison  to  an  incrementalcorpus  approach  based  on 
Lucene10,  i.e.  the  arriving  Tweets  were  constantly  added  to 
the  Lucene  index  and  score  with  the  custom  Lucene  scoring 
as  well  as  with  an  custom  OkapiBM25  implementation. 

5.1  Experimental  Setup  and  Data 

The  data  for  the  experiments  is  the  TREC  2011  microblog 
corpus11.  It  is  one  of  the  last  microblog  corpora  that  is  still 
freely  available.  Due  to  its  copyright  rules  Twitter  does 
not  allow  third  parties  to  provide  closed  sets  of  Tweets  for 
research  or  similar  purposes.  For  example,  the  Edinburgh 
Twitter  Corpus  [13]  was  a  scientifically  edited  corpus  but  is 
not  available  any  more  due  to  the  aforementioned  restric¬ 
tions. 

The  TREC  2011  corpus  is  not  a  Twitter  corpus  available 
for  free  download  either.  TREC  only  offers  the  id  of  Tweets 
that  are  used  for  the  TREC  conference.  The  Tweets  can 
be  downloaded  with  a  tool  that  crawls  either  the  HTML 
page  of  the  Tweet  with  the  given  ID  or  downloads  a  JSON 
version  of  the  corresponding  Tweet.  The  latter  makes  use 
of  Twitter  API  calls  which  are  usually  very  restricted  (e.g. 
150  per  hour)  which  dramatically  slows  down  the  download 
process.  We  have  used  the  HTML  version  as  this  allowed  us 
to  download  the  data  in  a  reasonable  amount  of  time. 

In  total,  there  are  16  million  Tweet  ids  available.  But 
Twitter  is  constantly  moving  its  data  and  that  is  why  it 
is  not  assured  that  each  Tweet  ID  provided  by  TREC  can 
be  downloaded,  what  in  turn  has  the  effect  that  every  re¬ 
search  group  obtains  a  different  corpus  depending  on  the 
time  Twitter  was  crawled.  We  have  downloaded  the  data 
from  Twitter  in  a  time  period  from  October  10  to  19,  2011. 

In  total  we  were  able  to  download  approx.  12  million 
Tweets.  For  our  runs,  only  English  Tweets  were  considered. 
For  this  purpose  we  used  the  language  detection  library  de¬ 
veloped  by  [17]. 

We  also  removed  Tweets  containing  more  than  four  ques¬ 
tion  marks  (’???’)  because  there  were  many  Tweets  that 
contained  only  question  marks  because  of  their  encoding12. 
Out  of  the  approx.  16  million  available  ids  that  were  pro¬ 
vided  by  the  TREC  board  we  could  use  approx.  4.5  million. 
The  proceedings  of  the  TREC  conference13  show  that  this 
is  average  of  Tweets  that  could  be  effectively  used. 

5.1.1  Evaluation 

In  total,  we  have  submitted  four  runs  to  TREC  2012: 
Two  runs  based  on  OkapiBM25  and  two  using  a  Vector 
Space  Model  approach,  each  with  and  without  relevance 

1(,https :  //lucene  .  apache  .  org/core/ 

11  Access  for  academic  purposes  can  be  requested  here  http: 
//tree .nist .gov/data/Tweets/ 

12  Foremost  Asian  languages  were  not  correctly  retrieved  by 
the  Crawler 

13The  TREC  proceedings  will  be  probably  available  in  the 
first  quarter  2013  http://trec.nist.gov/proceedings/ 
proceedings .html 


feedback.  After  the  TREC  deadline  we  continued  experi¬ 
menting.  These  results  are  also  shown  here.  We  compare  our 
results  with  the  best  run  from  the  TREC  microblog  filtering 
track  in  terms  of  T11SU,  f-measure,  precision  and  recall. 
The  runs  presented  in  this  section  were  evaluated  against 
the  relevance  assessment  provided  by  the  TREC  board14. 
The  qrel15  value  2  was  mapped  onto  value  1  to  get  to  the 
binary  case.  In  total  60129  Tweets  had  been  assessed,  out  of 
which  116  were  considered  very  bad  (-2),  57048  not  relevant 
(0),  2404  relevant  (1)  and  561  highly  relevant(2). 

The  used  evaluation  measure  are  described  in  [15].  All 
results  are  sorted  by  their  T11SU  value,  which  is  a  utility 
oriented  evaluation  measure[4,  p.  3]  and  which  is  the  stan¬ 
dard  evaluation  measure  in  the  filtering  tasks  of  TREC. 

Only  after  submitting  the  runs  to  TREC,  we  have  found 
out  that  the  Vector  Space  Model  (VSM)  results  got  cor¬ 
rupted  which  became  apparent  due  to  their  poor  perfor¬ 
mance.  The  reasons  for  this  are  under  investigation.  The 
okapivl  and  okapiv2rel  performed  quite  well,  while  okapivl 
being  our  best  run.  This  run  made  it  to  number  13  out  of 
69  submitted  runs. 

In  the  TREC  runs  the  ones  without  relevance  feedback 
did  better  than  the  ones  with  relevance  feedback.  This  is 
due  to  the  naive  feedback  approach  used  for  the  TREC  runs 
4.2,  which  diluted  the  search  profile.  So  the  search  profiles 
without  feedback  stayed  more  concise  and  hence  performed 
better. 

The  naive  feedback  approach  was  changed  for  the  post- 
TREC  runs  (reverse  okapi, ate  2)  and  this  improved  the  re¬ 
sult  significantly.  Besides  changing  the  feedback  mechanism 
we  experimented  with  different  window  sizes.  Increasing  the 
time  window  from  10  to  120  seconds  also  showed  better  per¬ 
formance16  . 

Furthermore  the  weight  for  hashtags  while  determining 
the  local  term  weight  was  increased  from  1.5  to  12.  The 
ratio  between  tokens  and  hashtags  is  quite  skewed,  so  the 
hashtag  weight  would  not  contribute  significantly  if  we  kept 
the  weight  so  low. 

The  aforementioned  adjustment  of  the  feedback  mecha¬ 
nism  and  the  dynamic  threshold  helped  to  increase  the  pre¬ 
cision  of  the  post  TREC  runs.  Also  the  experimentation 
with  different  ranking  schemes  showed  interesting  result. 
We  tried  two  versions  of  Okapi:  one  with  regular  docu¬ 
ment  length  normalization1'  and  one  with  reverse  normal¬ 
ization18.  The  goal  of  the  latter  was  to  improve  the  ranking 
for  longer  Tweets.  This  yielded  remarkably  better  result 
than  regular  Okapi.  Furthermore  we  tried  a  classic  Vector 
Space  Model  approach,  as  well  as  an  approach  (RSV)  only 
based  on  the  document  collection  features,  i.e.  only  the 
idf  values.  Finally  in  order  to  show  the  effectiveness  of  the 
event  based  approach  we  did  some  comparison  runs  based 
on  Lucene.  We  used  Lucene  for  indexing  and  retrieving  in¬ 
cremental  term  statistics  (document  and  token  count)  and 
calculated  a  cosine  similarity  measure  as  well  as  a  custom 


14http : //tree .nist . gov/data/microblog/ 11/ 
microblogll-qrels  access  restricted;  registration  re¬ 
quired 

15Relevance  assessment  value  provided  by  the  TREC  board 
16  Due  to  space  reasons  we  only  show  the  values  for  the  ad¬ 
justed  relevance  feedback  and  changed  window  size 

17^  doclength 

norm  = - -H - -r 

averagedoclength 
18 cr\  rw'rm  averagedoclength 


norm  - 


doclength 


OkapiBM25  one.  Both  performed  considerably  worse  than 
the  event  based  approach  using  sliding  windows. 

Table  3  shows  the  concrete  numbers  of  for  the  experi¬ 
ments  conducted  after  the  TREC  submission  deadline.  Ta¬ 
ble  4  shows  the  runs  without  using  the  subtraction  based 
relevance  feedback  mechanism. 

6.  CONCLUSION  AND  OUTLOOK 

Few  results  submitted  to  TREC  come  close  to  the  best 
value  for  a  specific  task,  but  many  of  the  results  are  above 
the  median.  For  the  the  post-TREC  runs  the  performance 
was  increased.  This  was  foremost  due  to  the  adjustment 
of  the  window  size  and  the  improvement  of  the  relevance 
feedback  mechanism.  Both  results  can  be  interpreted  as  a 
confirmation  of  the  viability  of  the  general  approach  employ¬ 
ing  an  event  processing  engine  for  microblog  stream  process¬ 
ing.  Besides  error  correction  we  will  focus  on  the  analysis 
of  additional  measures  of  comparison.  We  will  also  com¬ 
pare  different  strategies  of  construction  the  temporal  cor¬ 
pus,  where  we  will  investigate  a  dynamic  adjustment  of  the 
window  size  depending  on  evaluation  metrics  like  precision, 
recall  or  f-measure.  Additionally  we  want  investigate  how 
to  incorporate  the  context  of  the  search  terms  in  order  to 
improve  the  retrieval  quality  and  increase  recall.  Finally  we 
will  contrast  sliding  windows  with  incremental  approaches 
and  investigate  how  to  efficiently  set  the  decision  threshold 
in  order  to  maximize  the  performance  of  the  system. 
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Run  id 

Precision 

Recall 

F-Measure@.5 

T11SU 

Decision  Threshold 

okapivl  (TREC) 

0.3370 

0.1024 

0.3338 

0.1916 

0.5 

okapiv2rel  (TREC) 

0.2831 

0.1486 

0.2978 

0.1942 

0.5 

vsmvl  (TREC) 

0.1217 

0.0732 

0.2690 

0.0616 

0.5 

vmsv2rel  (TREC) 

0.1411 

0.0518 

nTTT  75 

0.381 

mr»  - - 7 — _ 

0.0835 

0.5 

Table  2:  TREC  run  summary 


scoring_function 

prec 

recall 

Lmeasure 

tllsu 

accuracy 

specificity 

tp 

tn 

fp 

fn 

1 

VSM 

0.4074 

0.4221 

0.3783 

0.3391 

0.8945 

0.9089 

771 

30191 

3026 

625 

2 

tfidf 

0.6199 

0.2115 

0.3737 

0.4155 

0.9275 

0.9541 

404 

15165 

729 

488 

3 

RSV 

0.6060 

0.3348 

0.4764 

0.4477 

0.9559 

0.9786 

568 

32418 

710 

813 

4 

reverse_okapi 

0.7198 

0.3482 

0.5518 

0.5148 

0.9677 

0.9904 

594 

32916 

318 

802 

5 

okapi_stream_count 

0.5977 

0.3437 

0.4775 

0.4504 

0.9557 

0.9783 

583 

32515 

720 

813 

6 

ate 

0.4861 

0.3438 

0.4188 

0.3898 

0.9470 

0.9641 

675 

31908 

1189 

634 

7 

lucene_tfidf_incr 

0.3514 

0.1364 

0.2207 

0.3027 

0.8761 

0.9053 

307 

14392 

1506 

573 

8 

lucene_okapi 

0.3288 

0.1953 

0.2274 

0.2824 

0.6833 

0.6941 

430 

11035 

4863 

450 

Table  3:  Post  TReC  runs  using  relevance  feedback  -  summary 


scoring_function 

prec 

recall 

Lmeasure 

tllsu 

accuracy 

specificity 

sum(tp) 

sum(tn) 

sum(fp) 

sum(fn) 

1 

ate 

0.2725 

0.1819 

0.2015 

0.2727 

0.5122 

1 

448 

2082 

1968 

441 

3 

okapi_stream_count 

0.2726 

0.3282 

0.2413 

0.2469 

0.7055 

1 

821 

9114 

3658 

489 

4 

reverse_okapi 

0.3026 

0.3483 

0.2711 

0.2681 

0.7595 

1 

801 

9893 

2878 

509 

5 

RSV 

0.3497 

0.1717 

0.2465 

0.3051 

0.7143 

1 

444 

3084 

966 

445 

6 

tfidf 

0.3227 

0.1890 

0.2295 

0.2780 

0.5868 

1 

518 

2380 

1670 

371 

7 

VSM 

0.3150 

0.1667 

0.2026 

0.2719 

0.5386 

1 

407 

2253 

1797 

482 

Table  4:  Post  TReC  runs  without  relevance  feedback  -  summary 


